Date: 12/05/2018
App: https://srinivasanswaminathan.shinyapps.io/SLCPD/
An ideal situation for any law enforcement agency is to prevent a crime from happening in the first place. In the unfortunate event of a crime scenario, the agency’s immediate objective is to intervene as soon as possible to diffuse the situation. A critical component of achieving these objectives is the presence of police in crime hotspots. It is common knowledge that visible presence of police deters crime from happening as well as improves response time.
The app can be used by civilians to view the crime hotspots on a map of Salt Lake City. This will be useful for tourists or new residents of the city to be aware of neighborhoods with a history of crime. These can be important factors in determining an apartment to rent or house to buy or places to hangout during a visit
In addition, law enforcement agencies can use this app for the following purposes:
* Deploy police in crime hotspots to both deter crime as well as improve response times
* Split the police grids into various clusters based on the nature of crime.
* Recruiment, training, personnel transfer and exchange of best practices can be tailored based on cluster groups
* City councils can implement anti-crime programs as well as perform analysis for policy research
library(shiny)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(skimr)
library(DT)
##
## Attaching package: 'DT'
## The following objects are masked from 'package:shiny':
##
## dataTableOutput, renderDataTable
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(leaflet)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(sp)
library(rgdal)
## rgdal: version: 1.3-6, (SVN revision 773)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
## Path to GDAL shared files: C:/Users/swati/Documents/R/win-library/3.5/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: C:/Users/swati/Documents/R/win-library/3.5/rgdal/proj
## Linking to sp version: 1.3-1
library(NbClust)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(formattable)
##
## Attaching package: 'formattable'
## The following object is masked from 'package:plotly':
##
## style
Source:https://opendata.utah.gov/Public-Safety/SALT-LAKE-CITY-POLICE-CASES-2016/trgz-4r9d
We used publicly available crime data from SLCPD for the year 2016. While most of the variables are self explanatory, some of the other variables are:
Case: Unique case ID
NCIC Code: National Crime Information Center Code
IBR:Incidnt Based Reporting
X.Coordinate:State Plane NAD83 X Coordinate
Y_Coordinate:State Plane NAD83 X Coordinate
options(stringsAsFactors = FALSE)
crime_data<-read.csv("C:\\Users\\swati\\Google Drive\\Analytical Applications\\project2\\SALT_LAKE_CITY_-_POLICE_CASES_-2016.csv",encoding ='ASCII')
names(crime_data)[1]<-"CASE"
str(crime_data)
## 'data.frame': 61240 obs. of 14 variables:
## $ CASE : chr "SL201624" "SL201637" "SL201656" "SL201648" ...
## $ NCIC.CODE : chr "2499-13" "5404-0" "5404-0" "5499-44" ...
## $ DESCRIPTION : chr "STOLEN VEHICLE" "DUI ALCOHOL" "DUI ALCOHOL" "ALCOHOL IN VEH" ...
## $ IBR : chr "240" "90D" "90D" "90D" ...
## $ OCCURRED : chr "01/01/2016 12:36:15 AM" "01/01/2016 01:00:24 AM" "01/01/2016 01:23:44 AM" "01/01/2016 01:10:27 AM" ...
## $ REPORTED : chr "01/01/2016 12:36:15 AM" "01/01/2016 01:00:24 AM" "01/01/2016 01:23:44 AM" "01/01/2016 01:10:27 AM" ...
## $ DAY.OF.WEEK : int 6 6 6 6 6 6 6 6 6 6 ...
## $ LOCATION : chr "1400 S EMERY ST " "200 W 500 S " "400 S WEST TEMPLE ST " "" ...
## $ CITY : chr "SALT LAKE CITY" "SALT LAKE CITY" "SALT LAKE CITY" "SALT LAKE CITY" ...
## $ CITY.COUNCIL: chr "2" "4" "4" "SL" ...
## $ POLICE.ZONE : chr "Z2" "Z3" "Z3" "40" ...
## $ POLICE.GRID : chr "123" "134" "135" "40" ...
## $ X.COORDINATE: int 1882550 1890252 1891039 1869739 1891819 1894816 1893431 1903627 1882890 1895533 ...
## $ Y_COORDINATE: int 876155 883639 883995 890270 880413 884414 879640 891202 879376 882346 ...
After closely analyzing the data in view table, we executed following steps to clean up the data
* Omit NA values
* Remove empty values for variabl LOCATION, CITY.COUNCIL, OCCURED
* Subset the data to include Police Zones Z1 to Z6 only as rest of the values were inaccurate
* Convert time values into relevant format
* Convert categorical variables into factor
crime_data<-na.omit(crime_data)
crime_data <- crime_data[-which(crime_data$LOCATION == ""), ]
crime_data<-crime_data[-which(crime_data$CITY.COUNCIL==""),]
crime_data <- crime_data[-which(crime_data$OCCURRED == ""), ]
crime_data<-subset(crime_data,POLICE.ZONE %in% c("Z1","Z2","Z3","Z4","Z5","Z6"))
crime_data$OCCURRED<- mdy_hms(crime_data$OCCURRED)
crime_data$REPORTED<- mdy_hms(crime_data$REPORTED)
cols<- c("NCIC.CODE","DESCRIPTION","IBR","DAY.OF.WEEK","LOCATION","CITY","CITY.COUNCIL", "POLICE.ZONE", "POLICE.GRID" )
crime_data[cols]<-lapply(crime_data[cols], factor)
str(crime_data)
## 'data.frame': 43006 obs. of 14 variables:
## $ CASE : chr "SL201624" "SL201637" "SL201656" "SL201660" ...
## $ NCIC.CODE : Factor w/ 339 levels "1004-0","1005-0",..: 101 262 262 262 260 262 253 336 36 36 ...
## $ DESCRIPTION : Factor w/ 58 levels "ALCOHOL IN VEH",..: 48 13 13 13 24 13 44 43 4 4 ...
## $ IBR : Factor w/ 48 levels "0","09A","100",..: 19 42 42 42 48 42 1 1 6 6 ...
## $ OCCURRED : POSIXct, format: "2016-01-01 00:36:15" "2016-01-01 01:00:24" ...
## $ REPORTED : POSIXct, format: "2016-01-01 00:36:15" "2016-01-01 01:00:24" ...
## $ DAY.OF.WEEK : Factor w/ 7 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ LOCATION : Factor w/ 3998 levels "000 100 S ",..: 1242 2067 2972 3938 2578 3899 2921 805 2863 1218 ...
## $ CITY : Factor w/ 3 levels "MURRAY","SALT LAKE CITY",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ CITY.COUNCIL: Factor w/ 9 levels "0","1","2","3",..: 3 5 5 6 5 6 4 3 5 6 ...
## $ POLICE.ZONE : Factor w/ 6 levels "Z1","Z2","Z3",..: 2 3 3 5 4 5 4 2 4 5 ...
## $ POLICE.GRID : Factor w/ 23 levels "111","112","113",..: 6 11 12 18 14 18 13 5 16 18 ...
## $ X.COORDINATE: int 1882550 1890252 1891039 1891819 1894816 1893431 1903627 1882890 1895533 1889454 ...
## $ Y_COORDINATE: int 876155 883639 883995 880413 884414 879640 891202 879376 882346 876317 ...
The data variables are now in appropriate format. Now let’s skim the data to check for completness
skim(crime_data)
## Skim summary statistics
## n obs: 43006
## n variables: 14
##
## -- Variable type:character --------------------------------------------------------------------------------
## variable missing complete n min max empty n_unique
## CASE 0 43006 43006 8 12 0 43003
##
## -- Variable type:factor -----------------------------------------------------------------------------------
## variable missing complete n n_unique
## CITY 0 43006 43006 3
## CITY.COUNCIL 0 43006 43006 9
## DAY.OF.WEEK 0 43006 43006 7
## DESCRIPTION 0 43006 43006 58
## IBR 0 43006 43006 48
## LOCATION 0 43006 43006 3998
## NCIC.CODE 0 43006 43006 339
## POLICE.GRID 0 43006 43006 23
## POLICE.ZONE 0 43006 43006 6
## top_counts ordered
## SAL: 43004, MUR: 1, SOU: 1, NA: 0 FALSE
## 4: 16810, 5: 7635, 2: 6527, 7: 3774 FALSE
## 6: 6795, 4: 6494, 5: 6439, 3: 6260 FALSE
## LAR: 6990, PUB: 5846, DRU: 4221, ASS: 3593 FALSE
## 0: 12577, 90Z: 5417, 23C: 3079, 13B: 2636 FALSE
## 200: 1333, 300: 1238, 200: 1109, 300: 562 FALSE
## 490: 3113, 230: 3079, 131: 2230, 549: 1982 FALSE
## 132: 7205, 152: 5171, 162: 2123, 111: 2079 FALSE
## Z3: 14495, Z5: 7131, Z6: 5884, Z1: 5783 FALSE
##
## -- Variable type:integer ----------------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25
## X.COORDINATE 0 43006 43006 1889922.83 8703.42 1843802 1886204
## Y_COORDINATE 0 43006 43006 882400.59 7072.58 862241 876879.5
## p50 p75 p100 hist
## 1889516 1894609 1916369 <U+2581><U+2581><U+2581><U+2582><U+2587><U+2587><U+2582><U+2581>
## 884010 886684 913504 <U+2581><U+2583><U+2583><U+2587><U+2583><U+2581><U+2581><U+2581>
##
## -- Variable type:POSIXct ----------------------------------------------------------------------------------
## variable missing complete n min max median n_unique
## OCCURRED 0 43006 43006 2016-01-01 2017-01-04 2016-07-04 42964
## REPORTED 0 43006 43006 2016-01-01 2017-01-04 2016-07-04 42964
Now lets add few more columns for analysis such as weekday when crime occurered, time of occurence etc.
crime_data$occurencemonth<-month(crime_data$OCCURRED)
crime_data$occurencehour<-hour(crime_data$OCCURRED)
crime_data$DAY.OF.WEEK<-weekdays.POSIXt(crime_data$OCCURRED,abbreviate = TRUE)
crime_data$DAY.OF.WEEK<-as.factor(crime_data$DAY.OF.WEEK)
crime_data$occurencemonth<-as.factor(crime_data$occurencemonth)
crime_data$occurencehour<-as.factor(crime_data$occurencehour)
The UCR Program collects statistics on the number of offenses known to law enforcement. In the traditional Summary Reporting System (SRS), there are eight crimes, or Part I offenses, (murder and nonnegligent homicide, rape (legacy & revised), robbery, aggravated assault, burglary, motor vehicle theft, larceny-theft, and arson) to be reported to the UCR Program. These offenses were chosen because they are serious crimes, they occur with regularity in all areas of the country, and they are likely to be reported to police. The Part I offenses are defined as: Criminal homicide, Forcible Rape/Legacy Rape, Revised Rape, Robbery,Aggravated assault,Burglary (breaking or entering), Larceny-theft (except motor vehicle theft), Motor vehicle theft, Arson.
Our data has around 58 categories of crime reported. For the effeciency of our algorithm to perform cluster analysis we will form bins for 8 major crimes and bucket the rest as Other crimes.
# Offense Table based on UCR type I crimes
Homicide<-"90"
Rape<-"11"
Robbery<-"12"
Assault<-"13"
Burglary<-"22"
Larceny<-"23"
Auto.Theft<-"24"
Arson<-"20"
Drugs<-"35"
crime_data$OffenseType = case_when(substr(crime_data$NCIC.CODE,1,2) %in% Homicide ~ "Homicide",
substr(crime_data$NCIC.CODE,1,2) %in% Rape ~ "Rape",
substr(crime_data$NCIC.CODE,1,2) %in% Robbery ~ "Robbery",
substr(crime_data$NCIC.CODE,1,2) %in% Assault ~ "Assault",
substr(crime_data$NCIC.CODE,1,2) %in% Burglary ~ "Burglary",
substr(crime_data$NCIC.CODE,1,2) %in% Larceny ~ "Larceny",
substr(crime_data$NCIC.CODE,1,2) %in% Auto.Theft ~ "Auto.Theft",
substr(crime_data$NCIC.CODE,1,2) %in% Arson ~ "Arson",
substr(crime_data$NCIC.CODE,1,2) %in% Drugs ~ "Drugs",
TRUE ~ "Others"
)
crime_data$OffenseType<-as.factor(crime_data$OffenseType)
group_by_crimetype<- group_by(crime_data,DESCRIPTION)
count_by_crimetype<- summarise(group_by_crimetype,n=n())
ggplot(data=subset(count_by_crimetype,n>200), aes(x=reorder(DESCRIPTION,-n),y=n))+
geom_bar(stat="identity",fill="steelblue")+
geom_text(aes(label=n),vjust=0.3,size=2.5)+
coord_flip()+
xlab("Crime Description") +
ylab("Count of Occurences in 2016")
#ggplotly(plot)%>% layout( autosize=F)%>% config(displayModeBar = F)
The graph shows 21 out of 58 categories of crime. The visualization contains crimes whose occurence is more than 200 for year 2016. We see that Larceny, Assault and Drugs are the major Part 1 offenses reported in year 2016.
SLC has 6 police zones divided into sub police grids. We analyze what crimes are reported majorly in each Police Zone.
group_by_policezone<- group_by(subset(crime_data,OffenseType!="Others"),POLICE.ZONE,OffenseType)
count_by_policezone<- summarise(group_by_policezone,n=n())
df_sorted <- arrange(count_by_policezone, POLICE.ZONE, OffenseType)
p2<-ggplot(data=subset(df_sorted,n>50), aes(x=reorder(POLICE.ZONE,n),y=n,fill=OffenseType))+
geom_bar(stat="identity")+
coord_flip()+
xlab("Police Zone") +
ylab("Count of Occurences in 2016")+
theme_minimal()
ggplotly(p2)%>% config(displayModeBar = F)
p1<-ggplot(crime_data,aes(x=OffenseType, fill=DESCRIPTION))+
geom_histogram(stat="count")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplotly(p1)%>% config(displayModeBar = F)
#KMean Clusteringlibrary(NbClust)
#Convert data into numerical data
library(reshape2)
Crime_Clust_Location<-dcast(subset(crime_data,OffenseType!="Others"),formula = POLICE.GRID~OffenseType,fun.aggregate = length, value.var = "OffenseType")
rownames(Crime_Clust_Location) <- Crime_Clust_Location[,1]
Crime_Clust_Location[,1] <- NULL
Crime_Clust<-scale(Crime_Clust_Location)
head(Crime_Clust)
## Arson Assault Auto.Theft Burglary Drugs Homicide
## 111 1.4649377 1.22286642 2.2096010 0.97368999 -0.21348827 1.5095603
## 112 1.4649377 0.77059019 0.4942174 -0.33086553 -0.08685172 -0.4193223
## 113 -1.4230823 -0.05858288 0.4528828 -0.79677822 -0.27884907 -0.4193223
## 121 -0.4604090 0.76116777 1.6102501 0.01080377 -0.21348827 -0.4193223
## 122 0.5022643 -0.45432457 0.1222065 -0.57935230 -0.27067897 -0.4193223
## 123 -1.4230823 -0.29414341 -0.2291371 -1.10738667 -0.30948695 -0.4193223
## Larceny Robbery
## 111 -0.1783812 0.04134139
## 112 0.2252782 1.36196920
## 113 -0.5158669 -0.59255995
## 121 -0.3769022 -0.27560928
## 122 -0.5787319 -0.53973484
## 123 -0.2842591 -0.43408462
The data is reshaped to include only the major crime categories and the row names are renames as police grid numbers. The data is then scaled for all major offenses.
Based on total within-cluster variation where the algorithm attempts to minimize total within-cluster sum of square
library(factoextra)
fviz_nbclust(Crime_Clust,kmeans,method="wss")+
labs(subtitle = "Elbow Method")
This algorithm measures how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
fviz_nbclust(Crime_Clust,kmeans,method="silhouette")+
labs(subtitle = "Silhouette Method")
Here, we choose the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1.
fviz_nbclust(Crime_Clust,kmeans,method="gap_stat",nboot = 50)+
labs(subtitle = "Gap statistic method")
The last two methods clearly identifies 2 as optimal number of clusters.
###Generate Cluster Center
clusters<-kmeans(Crime_Clust,2,nstart=25)
clusters$centers
## Arson Assault Auto.Theft Burglary Drugs Homicide
## 1 -0.0478347 -0.2367115 -0.1090701 -0.06462971 -0.2304121 -0.1437677
## 2 0.5022643 2.4854709 1.1452365 0.67861196 2.4193265 1.5095603
## Larceny Robbery
## 1 -0.2231272 -0.2655474
## 2 2.3428358 2.7882472
ClusterResult<-data.frame(clusters$centers)
summarizeClusterResult<-function(colno){
case_when(
ClusterResult[,colno]<quantile(ClusterResult[,colno],0.25)~"Low",
ClusterResult[,colno]>=quantile(ClusterResult[,colno],0.25)& ClusterResult[,colno]<quantile(ClusterResult[,colno],0.75)~"Medium",
ClusterResult[,colno]>=quantile(ClusterResult[,colno],0.75)~"High"
)
}
for (i in 1:8 ){
ClusterResult[,i]<-summarizeClusterResult(i)
}
library(formattable)
ClusterResult<-formattable(ClusterResult,list(
Arson= formatter("span",style=x~ifelse(
x=="High","color:red",ifelse(
x=="Medium","color:orange","color:green"
))),
Assault= formatter("span",style=x~ifelse(
x=="High","color:red",ifelse(
x=="Medium","color:orange","color:green"
))),
Burglary= formatter("span",style=x~ifelse(
x=="High","color:red",ifelse(
x=="Medium","color:orange","color:green"
))),
Robbery= formatter("span",style=x~ifelse(
x=="High","color:red",ifelse(
x=="Medium","color:orange","color:green"
))),
Drugs= formatter("span",style=x~ifelse(
x=="High","color:red",ifelse(
x=="Medium","color:orange","color:green"
))),
Auto.Theft= formatter("span",style=x~ifelse(
x=="High","color:red",ifelse(
x=="Medium","color:orange","color:green"
))),
Homicide= formatter("span",style=x~ifelse(
x=="High","color:red",ifelse(
x=="Medium","color:orange","color:green"
))),
Larceny= formatter("span",style=x~ifelse(
x=="High","color:red",ifelse(
x=="Medium","color:orange","color:green"
)))))
ClusterResult
| Arson | Assault | Auto.Theft | Burglary | Drugs | Homicide | Larceny | Robbery |
|---|---|---|---|---|---|---|---|
| Low | Low | Low | Low | Low | Low | Low | Low |
| High | High | High | High | High | High | High | High |
The above results shows the high influence of what crimes in which cluster. We will now update the cluster information in our previous Cluster Table with crimes and its freqency in various police grids.
Crime_Clust_Location$Cluster <- (clusters$cluster)
The graph below shows what grids fall into which cluster
str(clusters)
## List of 9
## $ cluster : Named int [1:23] 1 1 1 1 1 1 1 1 2 1 ...
## ..- attr(*, "names")= chr [1:23] "111" "112" "113" "121" ...
## $ centers : num [1:2, 1:8] -0.0478 0.5023 -0.2367 2.4855 -0.1091 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "1" "2"
## .. ..$ : chr [1:8] "Arson" "Assault" "Auto.Theft" "Burglary" ...
## $ totss : num 176
## $ withinss : num [1:2] 86.3 24.8
## $ tot.withinss: num 111
## $ betweenss : num 64.8
## $ size : int [1:2] 21 2
## $ iter : int 1
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
fviz_cluster(clusters, data = Crime_Clust_Location,
repel = TRUE)
Now lets assign our cluster to parent Crimedata table. But first lets convert state plane coordinates into lat and long
library(rgdal)
nad83_coords <- data.frame(lon=crime_data$X.COORDINATE,lat=crime_data$Y_COORDINATE)
coordinates(nad83_coords) <- c("lon", "lat")
proj4string(nad83_coords)=CRS("+init=esri:102285")
x<-data.frame(spTransform(nad83_coords,CRS("+init=epsg:4326")))
crime_data$lon<-x$lon
crime_data$lat<-x$lat
Crime_Clust_Location$POLICE.GRID<-rownames(Crime_Clust_Location)
Crime_Clust_Location$POLICE.GRID<-as.factor(Crime_Clust_Location$POLICE.GRID)
crime_data$cluster<-Crime_Clust_Location$Cluster[match(crime_data$POLICE.GRID,Crime_Clust_Location$POLICE.GRID)]
crime_data$cluster <- paste("Cluster",crime_data$cluster, sep=" ")
crime_data$cluster<-as.factor(crime_data$cluster)
names(crime_data)
## [1] "CASE" "NCIC.CODE" "DESCRIPTION" "IBR"
## [5] "OCCURRED" "REPORTED" "DAY.OF.WEEK" "LOCATION"
## [9] "CITY" "CITY.COUNCIL" "POLICE.ZONE" "POLICE.GRID"
## [13] "X.COORDINATE" "Y_COORDINATE" "occurencemonth" "occurencehour"
## [17] "OffenseType" "lon" "lat" "cluster"
group_by_cluster<- group_by(subset(crime_data,OffenseType!="Others"),cluster,OffenseType)
count_by_cluster<- summarise(group_by_cluster,n=n())
df_sorted <- arrange(count_by_cluster, cluster, OffenseType)
p4<-ggplot(data=subset(df_sorted), aes(x=reorder(cluster,n),y=n,fill=OffenseType))+
geom_bar(stat="identity")+
coord_flip()+
xlab("Crime Cluster") +
ylab("Count of Occurences in 2016")+
theme_minimal()
ggplotly(p4)%>% config(displayModeBar = F)
ggplot(subset(crime_data,OffenseType!="Others"),aes(x=DAY.OF.WEEK,fill=OffenseType))+
geom_histogram(stat="count")+
facet_wrap(~cluster,ncol = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
We chose to use logit ahead of SVC as we wanted to find areas with high probability of crime for a specific hour and day of the week. However, upon running logistic regression with 70-30 split, the top 5 police grids for a specific crime remained the same irrespective of hour of the day or day of the week. It is possible that further predictor variables are required in order to come up with an effective prediction model. Hence, we restricted our analysis to clustering.’